Introduction: What is machine learning?

Machine learning is a combination of statistics and computer science that is often defined as an algorithm, computer, or other machine that can learn on its own as it is given more data and without human input. This definition easily becomes conflated with opaque ideas of “artificial intelligence” and is too frequently associated with romanticized technological advancements in transportation, robotics, cognition, and human-computer interaction - although it does apply to these fields.

However, we thankfully do not need to think of machine learning in such complicated terms. Instead, it can be thought of as a toolbox for exploring data for application to research problems in virtually all fields of study. We can think of machine learning as a

“vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs… With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.” (James et al. 2021, p. 1).

James G, Witten D, Hastie T, Tibshirani R. 2021. An Introduction to Statistical Learning: With Applications in R, 2nd edition.

Machine learning terminology

Machine learning tasks are generally defined as belonging to one of a small handful of groups: supervised, unsupervised, deep, semi-supervised, reinforcement, and targeted/causal.

Our focus in this workshop is on supervised learning for classification using the SuperLearner R package and we will follow the Guide to SuperLearner.

  1. Supervised learning: is a predictive technique that trains a model on known, labeled data. The goal is to understand the relationships between variables in a dataset so that the trained model can be used to make predictions on new data it has not yet seen and whose labels are unknown. Or,

“We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).” (James et al. 2021, p. 26)

  • Classification: is the supervised task when the Y variable is discrete/categorical.

  • Regression: is the supervised task when the Y variable is continuous (i.e., numeric or integer).

  1. Supervised syntax

\(Y ~ X1 + X2 + X3 ... + Xn\)

Simply put, we want to use the X variables to predict Y.

  1. X and Y variables:
  • X is/are the independent variable(s) that we use to do the predicting. These are also referred to as features, covariates, predictors, and explanatory/input variables.

  • Y is the dependent variable and the one we want to predict. It is also referred to as the outcome, target, or response variable. Although predicting the label itself might be convenient, predicting the class probabilities is more efficient.

  1. Data splitting: Predictions are usually evaluated twice: 1) first on the training set to see how well the model can learn the relationships between the X and Y variables, and then 2) on the test set to see how well the trained model can generalize to predicting on new data. To accomplish this task, a given dataset is divided into training and test sets.

NOTE: a validation set is also sometimes used for hyperparameter tuning/model selection on the training dataset.

  • The training set: generally consists of the majority portion of the original dataset (70%, for example) where the model can learn the relationships between the X and Y variables.

    • Hyperparameters: are the configurations manually set by the programmer prior to model training through heuristics and trial and error. We do not know the optimal combinations of these options, and must tune the hyperparameters through the model training process to optimize prediction performance.
  • The test set: consists of the remaining portion of the dataset (30% in this example) that the trained model will then try to predict without seeing the Y labels.

  • k-fold cross-validation: is a preferred method for approaching the data splitting process because it repeats the train/test split process “k” number of times and rotates portions of the dataset to ensure that each observation is in the test set at least once.

  1. Performance metrics: It is necessary to evaluate model performance on the training and test sets (and validation set, when applicable) through a variety of confusion matrix derivations.

We will focus on two classification metrics in this workshop: 1. Risk - as measured by mean squared error 2. AUC-ROC - area under the curve - receiver operator characteristic; a measure of true positive rate versus true negative rate

  • A model is underfit if it cannot learn the relationships between the X and Y variables on the training set.

  • A model is overfit if it adequately learns the relationships between the X and Y variables and performs well on the training set, but performs poorly on the test set.

Example machine learning workflow

  1. Read literature in your field to understand what has been done; annotate what you read
  2. Formulate a research question(s)
  3. Obtain data
  4. Preprocess data
  • scale variables when necessary
  • handle missing values if present (listwise delete, median impute, generalized low rank model impute, etc.)
  • convert factor variables to numeric indicators if present
  1. Define x and y variables
  2. Split data into train and test sets
  3. Train and evaluate performance of a single algorithm on as a prototype
  4. Examine the trained model’s performance on the test set
  5. Create and deploy a cross-validated ensemble

Today’s topic: Predicting heart disease

https://theheartfoundation.org/heart-disease-facts-2/ We investigate how well individual algorithms and a SuperLearner ensemble weighted average can predict heart disease (yes/no) using other health indicators as features/predictors. Learn more about the dataset at the UCI Machine Learning Repository

SuperLearner

SuperLearner is an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings. It then creates an optimal weighted average of those models, aka an “ensemble”, using the test data performance. This approach has been proven to be asymptotically as accurate as the best possible prediction algorithm that is tested. Guide to SuperLearner - 1 Background

In this manner, a SuperLearner ensemble is a powerful tool because it:

  1. Eliminates bias of single algorithm selection for framing a research problem

  2. Allows for comparison of multiple algorithms, and/or comparison of the same model but tuned in many different ways

  3. Utilizes a second-level algorithm that produces an ideal weighted prediction that is suitable for data of virtually all distributions and uses external cross-validation to prevent overfitting.

Read the papers:

Setup

Package installation

Install and library the packages we will use in this workshop. The pacman package management tool handles everything for you in the code chunk below.

You should consider using it (or something like it) for your own research. Read the documentation here.

if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load("caret",         # create stratified random split of the dataset
               "ck37r",         # Chris Kennedy's Machine Learning Helper Toolkit
               "ggplot2",       # visualize risk and ROC
               "glmnet",        # the elastic net algorithm
               "ranger",        # random forest algorithm
               "rpart",         # decision tree algorithm
               "scales",        # for scaling data
               "SuperLearner",  # fit individual algorithms and ensembles
               "ROCR",          # compute AUC-ROC performance
               "xgboost")       # boosted tree algorithm

The ck37r package

If the CRAN installation for the ck37r package does not work correctly, install it from GitHub by unhashtagging the three lines of code below.

# install.packages("remotes")
# remotes::install_github("ck37/ck37r")
# library(ck37r)

Import and preprocess the data

Import and preprocess the data in six steps, and create a new variable for each step:

  1. Import the dataset: The variable raw contains data from the raw .csv file.

  2. Convert categorical variables: raw_fac is used to convert categorical variables to factor type.

  3. Convert factors to numeric indicators: raw_df will be for conversion of factors to numeric indicators.

  4. Identify and remove variables to be scaled: removed consists of the preprocessed data excluding the continuous variables to be scaled.

  5. Scale the continuous variables: rescaled is the variable for the scaled variables.

  6. Produce clean dataset: clean is the final merged dataset; a combination of the variables removed and rescaled.

  7. Save the clean dataset: using the save function. You can load it with the load function so you do not have to repeat these preprocessing steps again.

1. Import the dataset

Read in the raw .csv file and save it in a variable named raw.

raw <- read.csv("data/raw/heart.csv")
str(raw)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

2. Convert categorical variables

The variables cp, restecg, slope, ca, and thal are variables to be converted to nominal categorical type. Perhaps ordinal factors could be even more appropriate.

Save this factorized version in a variable named raw_fac.

raw_fac <- ck37r::categoricals_to_factors(data = raw, 
                                          categoricals = c("cp", "restecg", "slope", "ca", "thal"))
str(raw_fac)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

3. Convert factors to numeric indicators

Most machine learning algorithms require that input and output variables are numeric and therefore do not handle factor data well (although with some exceptions).

Therefore, we want to convert the factor variables to numeric indicator variables (aka one-hot/dummy coding). See a few examples here.

Name this variable raw_df.

# Save this as an intermediate variable named raw_ind 
raw_ind <- ck37r::factors_to_indicators(data = raw_fac, 
                                        verbose = TRUE)
## Converting factors (5): cp, restecg, slope, ca, thal
## Converting cp from a factor to a matrix (4 levels).
## : cp_1 cp_2 cp_3 
## Converting restecg from a factor to a matrix (3 levels).
## : restecg_1 restecg_2 
## Converting slope from a factor to a matrix (3 levels).
## : slope_1 slope_2 
## Converting ca from a factor to a matrix (5 levels).
## : ca_1 ca_2 ca_3 ca_4 
## Converting thal from a factor to a matrix (4 levels).
## : thal_1 thal_2 thal_3 
## Combining factor matrices into a data frame.
# Extract the actual data portion from raw_ind and save as raw_df
raw_df <- raw_ind$data
str(raw_df)
## 'data.frame':    303 obs. of  23 variables:
##  $ age      : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex      : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ trestbps : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol     : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs      : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ thalach  : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak  : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ target   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cp_1     : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2     : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ restecg_1: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg_2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ slope_1  : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2  : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1   : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2   : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3   : int  0 0 0 0 0 0 0 1 1 0 ...

4. Identify and remove variables to be scaled

Identify and remove the variables age, trestbps, chol, thalach, and oldpeak to be scaled.

# First, investigate the data to identify variables to scale
names(raw_df)
##  [1] "age"       "sex"       "trestbps"  "chol"      "fbs"       "thalach"  
##  [7] "exang"     "oldpeak"   "target"    "cp_1"      "cp_2"      "cp_3"     
## [13] "restecg_1" "restecg_2" "slope_1"   "slope_2"   "ca_1"      "ca_2"     
## [19] "ca_3"      "ca_4"      "thal_1"    "thal_2"    "thal_3"
summary(raw_df)
##       age             sex            trestbps          chol      
##  Min.   :29.00   Min.   :0.0000   Min.   : 94.0   Min.   :126.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:120.0   1st Qu.:211.0  
##  Median :55.00   Median :1.0000   Median :130.0   Median :240.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :131.6   Mean   :246.3  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:140.0   3rd Qu.:274.5  
##  Max.   :77.00   Max.   :1.0000   Max.   :200.0   Max.   :564.0  
##       fbs            thalach          exang           oldpeak    
##  Min.   :0.0000   Min.   : 71.0   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.0000   1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00  
##  Median :0.0000   Median :153.0   Median :0.0000   Median :0.80  
##  Mean   :0.1485   Mean   :149.6   Mean   :0.3267   Mean   :1.04  
##  3rd Qu.:0.0000   3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60  
##  Max.   :1.0000   Max.   :202.0   Max.   :1.0000   Max.   :6.20  
##      target            cp_1            cp_2             cp_3        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.000   Median :0.0000   Median :0.00000  
##  Mean   :0.5446   Mean   :0.165   Mean   :0.2871   Mean   :0.07591  
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.00000  
##    restecg_1        restecg_2         slope_1         slope_2      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.5017   Mean   :0.0132   Mean   :0.462   Mean   :0.4686  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##       ca_1             ca_2             ca_3              ca_4       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.2145   Mean   :0.1254   Mean   :0.06601   Mean   :0.0165  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##      thal_1            thal_2           thal_3      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :1.0000   Median :0.0000  
##  Mean   :0.05941   Mean   :0.5479   Mean   :0.3861  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000
summary(raw_df[,c("age", "trestbps", "chol", "thalach", "oldpeak")])
##       age           trestbps          chol          thalach         oldpeak    
##  Min.   :29.00   Min.   : 94.0   Min.   :126.0   Min.   : 71.0   Min.   :0.00  
##  1st Qu.:47.50   1st Qu.:120.0   1st Qu.:211.0   1st Qu.:133.5   1st Qu.:0.00  
##  Median :55.00   Median :130.0   Median :240.0   Median :153.0   Median :0.80  
##  Mean   :54.37   Mean   :131.6   Mean   :246.3   Mean   :149.6   Mean   :1.04  
##  3rd Qu.:61.00   3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:166.0   3rd Qu.:1.60  
##  Max.   :77.00   Max.   :200.0   Max.   :564.0   Max.   :202.0   Max.   :6.20
# Save this as an intermediate variable named to_scale
to_scale <- raw_df[,c("age", "trestbps", "chol", "thalach", "oldpeak")]

Also see ?ck37r::standardize for a simple extension

Remove the variables to be scaled and save the remaining variables in a dataframe named removed.

removed <- raw_df[,-c(1, # age
                      3, # trestbps
                      4, # chol
                      6, # thalach
                      8  # oldpeak
                      )]

5. Scale the continuous variables

Rescale these variables to a rang of 0 to 1 in a variable named rescaled.

rescaled <- as.data.frame(rescale(as.matrix(to_scale), to = c(0,1), ncol = 1))
summary(rescaled)
##       age             trestbps           chol           thalach      
##  Min.   :0.05142   Min.   :0.1667   Min.   :0.2234   Min.   :0.1259  
##  1st Qu.:0.08422   1st Qu.:0.2128   1st Qu.:0.3741   1st Qu.:0.2367  
##  Median :0.09752   Median :0.2305   Median :0.4255   Median :0.2713  
##  Mean   :0.09639   Mean   :0.2334   Mean   :0.4366   Mean   :0.2653  
##  3rd Qu.:0.10816   3rd Qu.:0.2482   3rd Qu.:0.4867   3rd Qu.:0.2943  
##  Max.   :0.13652   Max.   :0.3546   Max.   :1.0000   Max.   :0.3582  
##     oldpeak        
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.001418  
##  Mean   :0.001843  
##  3rd Qu.:0.002837  
##  Max.   :0.010993

6. Produce clean dataset

Finally, recombine the original data with the scaled variables for the final clean dataset.

clean <- cbind(removed,       # cleaned data without the scaled variables
               rescaled       # scaled variables
               )

# The scaled variables are the last five!
str(clean)
## 'data.frame':    303 obs. of  23 variables:
##  $ sex      : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ fbs      : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ exang    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ target   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cp_1     : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2     : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ restecg_1: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg_2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ slope_1  : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2  : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1   : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2   : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3   : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ age      : num  0.1117 0.0656 0.0727 0.0993 0.1011 ...
##  $ trestbps : num  0.257 0.23 0.23 0.213 0.213 ...
##  $ chol     : num  0.413 0.443 0.362 0.418 0.628 ...
##  $ thalach  : num  0.266 0.332 0.305 0.316 0.289 ...
##  $ oldpeak  : num  0.00408 0.00621 0.00248 0.00142 0.00106 ...

Save the clean dataset

Save the clean dataset so you do not have to repeat these preprocessing steps.

save(clean,                                  # variable(s) to be saved
     file = "data/preprocessed/clean.RData") # file name

# You can always load the clean dataset with
load("data/preprocessed/clean.RData")

Stay organized

One handy way to keep your machine learning organized is by storing all the components in a list. Let’s call our list heart_class, for our task of heart disease classification.

Our first two list components will be the dataset and outcome variable. In the heart dataset, the target variable is coded with a 1 if the patient had heart disease and a 0 if they did not.

heart_class <- list(
  data = clean, 
  outcome = "target"
)

# View the list contents
names(heart_class)
## [1] "data"    "outcome"
# Access part of the list with the dollar sign
head(heart_class$data)
##   sex fbs exang target cp_1 cp_2 cp_3 restecg_1 restecg_2 slope_1 slope_2 ca_1
## 1   1   1     0      1    0    0    1         0         0       0       0    0
## 2   1   0     0      1    0    1    0         1         0       0       0    0
## 3   0   0     0      1    1    0    0         0         0       0       1    0
## 4   1   0     0      1    1    0    0         1         0       0       1    0
## 5   0   0     1      1    0    0    0         1         0       0       1    0
## 6   1   0     0      1    0    0    0         1         0       1       0    0
##   ca_2 ca_3 ca_4 thal_1 thal_2 thal_3        age  trestbps      chol   thalach
## 1    0    0    0      1      0      0 0.11170213 0.2570922 0.4131206 0.2659574
## 2    0    0    0      0      1      0 0.06560284 0.2304965 0.4432624 0.3315603
## 3    0    0    0      0      1      0 0.07269504 0.2304965 0.3617021 0.3049645
## 4    0    0    0      0      1      0 0.09929078 0.2127660 0.4184397 0.3156028
## 5    0    0    0      0      1      0 0.10106383 0.2127660 0.6276596 0.2890071
## 6    0    0    0      1      0      0 0.10106383 0.2482270 0.3404255 0.2624113
##        oldpeak
## 1 0.0040780142
## 2 0.0062056738
## 3 0.0024822695
## 4 0.0014184397
## 5 0.0010638298
## 6 0.0007092199
heart_class$outcome
## [1] "target"

Add covariates

heart_class$covariates <- setdiff(names(heart_class$data), heart_class$outcome)

# The covariate names appear in the heart_class list!
names(heart_class)
## [1] "data"       "outcome"    "covariates"
# Call the covariates with (notice that "target" is excluded)
heart_class$covariates
##  [1] "sex"       "fbs"       "exang"     "cp_1"      "cp_2"      "cp_3"     
##  [7] "restecg_1" "restecg_2" "slope_1"   "slope_2"   "ca_1"      "ca_2"     
## [13] "ca_3"      "ca_4"      "thal_1"    "thal_2"    "thal_3"    "age"      
## [19] "trestbps"  "chol"      "thalach"   "oldpeak"

Add training rows

set.seed(1) 
training_rows <- 
  caret::createDataPartition(heart_class$data[[heart_class$outcome]], 
                             p = 0.70, 
                             list = FALSE)

heart_class$train_rows <- training_rows

# We have added the training rows to our list
names(heart_class)
## [1] "data"       "outcome"    "covariates" "train_rows"
head(heart_class$train_rows)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         4
## [4,]         5
## [5,]         6
## [6,]         9

Split data into training and test sets

x_train <- heart_class$data[heart_class$train_rows, heart_class$covariates]
y_train <- heart_class$data[heart_class$train_rows, heart_class$outcome]

x_test <- heart_class$data[-heart_class$train_rows, heart_class$covariates]
y_test <- heart_class$data[-heart_class$train_rows, heart_class$outcome]

# Mean and frequencies of training set
mean(heart_class$data[training_rows, "target"])
## [1] 0.5446009
table(heart_class$data[training_rows, "target"])
## 
##   0   1 
##  97 116
# Mean and frequencies of test set
mean(heart_class$data[-training_rows, "target"])
## [1] 0.5444444
table(heart_class$data[-training_rows, "target"])
## 
##  0  1 
## 41 49
# Add these variables to our list for safekeeping
heart_class$x_train <- x_train
heart_class$y_train <- y_train
heart_class$x_test <- x_test
heart_class$y_test <- y_test
names(heart_class)
## [1] "data"       "outcome"    "covariates" "train_rows" "x_train"   
## [6] "y_train"    "x_test"     "y_test"
# For example,
heart_class$y_train
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Also save X and Y for the ensemble
heart_class$Y <- clean$target
heart_class$X <- clean[,-4] 

Save the list heart_class

This way, you do not have to repeat all of these steps. You can just load("data/preprocessed/heart_class.RData")

names(heart_class)
##  [1] "data"       "outcome"    "covariates" "train_rows" "x_train"   
##  [6] "y_train"    "x_test"     "y_test"     "Y"          "X"
save(heart_class,
     file = "data/preprocessed/heart_class.RData")

# Load it with the load() function
load("data/preprocessed/heart_class.RData")

Fit single models

It is often good to fit a single model to prototype your machine learning goals.

Let’s fit three! Click the links for more information.

  1. Single decision tree

  2. Random forest

  3. Lasso

View available models

SuperLearner supports a wide variety of useful algorithms:

listWrappers()
## All prediction algorithm wrappers in SuperLearner:
##  [1] "SL.bartMachine"      "SL.bayesglm"         "SL.biglasso"        
##  [4] "SL.caret"            "SL.caret.rpart"      "SL.cforest"         
##  [7] "SL.earth"            "SL.extraTrees"       "SL.gam"             
## [10] "SL.gbm"              "SL.glm"              "SL.glm.interaction" 
## [13] "SL.glmnet"           "SL.ipredbagg"        "SL.kernelKnn"       
## [16] "SL.knn"              "SL.ksvm"             "SL.lda"             
## [19] "SL.leekasso"         "SL.lm"               "SL.loess"           
## [22] "SL.logreg"           "SL.mean"             "SL.nnet"            
## [25] "SL.nnls"             "SL.polymars"         "SL.qda"             
## [28] "SL.randomForest"     "SL.ranger"           "SL.ridge"           
## [31] "SL.rpart"            "SL.rpartPrune"       "SL.speedglm"        
## [34] "SL.speedlm"          "SL.step"             "SL.step.forward"    
## [37] "SL.step.interaction" "SL.stepAIC"          "SL.svm"             
## [40] "SL.template"         "SL.xgboost"
## 
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
## [4] "screen.randomForest"   "screen.SIS"            "screen.template"      
## [7] "screen.ttest"          "write.screen.template"

Decision tree

A decision tree is an algorithm is useful because they are relatively simple to construct and interpret. They can work with categorical data directly, without the need for one-hot encoding.

Decision trees work by recursively partitioning, or partitioning, the feature space into smaller regions that contain less data. Splitting takes place at nodes and each directional split is called a branch. The top node is called the root node and seeks to find the most discriminating split, thus always putting the most optimal partition first.

Keep in mind that single decision trees are prone to overfitting and high variance.

The tree stops partitioning when it fails to meet some hyperparameter condition. See ?rpart.control for a list of tunings.

Unhashtag the below code to reproduce the tree figure for the training data.

?rpart.control
# install.packages("rpart.plot")
# library(rpart.plot)
# (dt_example_plot <- rpart(y_train ~ ., 
#                           data = x_train, 
#                           method = "class",
#                           parms = list(split = "information")))
# rpart.plot(dt_example_plot)

training set decision tree plot

Train the model with SuperLearner

set.seed(1)
dt <- SuperLearner(Y = heart_class$y_train, 
                   X = heart_class$x_train, 
                   family = binomial(),
                   SL.library = "SL.rpart")
dt
## 
## Call:  
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),  
##     SL.library = "SL.rpart") 
## 
## 
##                   Risk Coef
## SL.rpart_All 0.1743429    1
# Accuracy
(dt_acc <- 1 - dt$cvRisk)
## SL.rpart_All 
##    0.8256571

Test set AUC-ROC:

# Predict on test set
dt_predictions = predict(dt, heart_class$x_test, onlySL = TRUE)
str(dt_predictions)
## List of 2
##  $ pred           : num [1:90, 1] 0.917 0.917 0.652 0.917 0.917 ...
##  $ library.predict: num [1:90, 1] 0.917 0.917 0.652 0.917 0.917 ...
summary(dt_predictions$library.predict)
##        V1         
##  Min.   :0.09091  
##  1st Qu.:0.11111  
##  Median :0.65217  
##  Mean   :0.59937  
##  3rd Qu.:0.91667  
##  Max.   :0.91667
# Visualize predicted probabilities
qplot(dt_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Compute AUC-ROC and save as variable
dt_rocr_score = ROCR::prediction(dt_predictions$pred, heart_class$y_test)
dt_auc_roc = ROCR::performance(dt_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]

# AUC-ROC
dt_auc_roc
## [1] 0.7366849

Random forest

The random forest algorithm is a big step up from a single decision tree.

It is a forest because it grows multiple decision trees.

It is random because it bootstrap samples (with replacement) two-thirds of the dataset for each tree (this is called bootstrap aggregating, or “bagging”) and evaluates performance on the remaining one-third of the data (called the “out-of-bag” sample).

It is also random because instead of trying to place the most discriminating split as the root node, it selects a random sample of features to try at each split. This allows for good splits to be followed by bad splits, and for bad splits to be followed by good splits thus helping decorrelate the average performance estimator for a more robust conclusion.

See ?ranger for hyperparameter tuning options.

?ranger

Train the model

set.seed(1)
rf <- SuperLearner(Y = heart_class$y_train, 
                   X = heart_class$x_train, 
                   family = binomial(),
                   SL.library = "SL.ranger")
rf
## 
## Call:  
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),  
##     SL.library = "SL.ranger") 
## 
## 
##                    Risk Coef
## SL.ranger_All 0.1293912    1
# Accuracy
(rf_acc <- 1 - rf$cvRisk)
## SL.ranger_All 
##     0.8706088

Test set AUC-ROC:

rf_predictions = predict(rf, heart_class$x_test, onlySL = TRUE)
str(rf_predictions)
## List of 2
##  $ pred           : num [1:90, 1] 0.97 0.762 0.566 0.901 0.941 ...
##  $ library.predict: num [1:90, 1] 0.97 0.762 0.566 0.901 0.941 ...
summary(rf_predictions$library.predict)
##        V1          
##  Min.   :0.008188  
##  1st Qu.:0.337193  
##  Median :0.619751  
##  Mean   :0.592442  
##  3rd Qu.:0.874289  
##  Max.   :0.992931
# Visualize predicted probabilities
qplot(rf_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Compute AUC-ROC and save as variable
rf_rocr_score = ROCR::prediction(rf_predictions$pred, heart_class$y_test)
rf_auc_roc = ROCR::performance(rf_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]

# AUC-ROC 
rf_auc_roc
## [1] 0.8596317

Lasso

The lasso (least absolute square shrinkage operator) is a form of penalized regression and classification that shrinks the beta coefficients to zero of features that are not related to the outcome variable.

We can follow the “one standard error rule”, which allows us to select a higher lambda value (the regularization parameter) within one standard error of the minimum value and sacrifice a little error for a model that contains less variables but is ostensibly easier to interpret.

View the help files with ?glmnet and ?cv.glmnet to learn more.

?glmnet
?cv.glmnet
set.seed(1)
lasso <- cv.glmnet(x = as.matrix(heart_class$X), 
                   y = heart_class$Y, 
                   family = "binomial", 
                   alpha = 1)
lasso
## 
## Call:  cv.glmnet(x = as.matrix(heart_class$X), y = heart_class$Y, family = "binomial",      alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.01012    36  0.7631 0.06686      18
## 1se 0.03392    23  0.8285 0.04659      15
# View path plot
plot(lasso)

# View coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)

# Show coefficients for 1se model
(coef_min = coef(lasso, s = "lambda.1se"))
## 23 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)   -0.5608427
## sex           -0.4290884
## fbs            .        
## exang         -0.6548822
## cp_1           0.1442182
## cp_2           0.6982435
## cp_3           0.3563876
## restecg_1      0.1032213
## restecg_2      .        
## slope_1       -0.3557496
## slope_2        0.1325356
## ca_1          -0.8761617
## ca_2          -1.2264879
## ca_3          -0.7966503
## ca_4           .        
## thal_1         .        
## thal_2         0.6554511
## thal_3        -0.5795217
## age            .        
## trestbps       .        
## chol           .        
## thalach        6.1897480
## oldpeak     -165.2682982

Lasso - train the model

set.seed(1)
la <- SuperLearner(Y = heart_class$y_train, 
                   X = heart_class$x_train, 
                   family = binomial(),
                   SL.library = "SL.glmnet")
la
## 
## Call:  
## SuperLearner(Y = heart_class$y_train, X = heart_class$x_train, family = binomial(),  
##     SL.library = "SL.glmnet") 
## 
## 
##                    Risk Coef
## SL.glmnet_All 0.1178595    1
# 0.86 accuracy
(la_risk <- 1-la$cvRisk)
## SL.glmnet_All 
##     0.8821405

Lasso test set AUC-ROC:

la_predictions = predict(la, heart_class$x_test, onlySL = TRUE)
str(la_predictions)
## List of 2
##  $ pred           : num [1:90, 1] 0.916 0.74 0.749 0.859 0.965 ...
##  $ library.predict: num [1:90, 1] 0.916 0.74 0.749 0.859 0.965 ...
summary(la_predictions$library.predict)
##        V1          
##  Min.   :0.008801  
##  1st Qu.:0.299391  
##  Median :0.633026  
##  Mean   :0.580456  
##  3rd Qu.:0.903609  
##  Max.   :0.988807
# Visualize predicted probabilities
qplot(la_predictions$pred[, 1]) + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Compute AUC-ROC and save as variable
la_rocr_score = ROCR::prediction(la_predictions$pred, heart_class$y_test)
la_auc_roc = ROCR::performance(la_rocr_score, measure = "auc", x.measure = "cutoff")@y.values[[1]]

# AUC-ROC 
la_auc_roc
## [1] 0.8939771

Ensemble

Now, lets fit an ensemble with two more algorithms include:

  1. xgboost

  2. The mean of Y

Summary of algorithm definitions

Train the ensemble of models

set.seed(1)
cv_sl <- CV.SuperLearner(Y = heart_class$Y,  # heart disease yes/no
                         X = heart_class$X,  # excluding the "target" variable
                         family = binomial(),
                         # For your own research, 5, 10, or 20 are good options
                         cvControl = list(V = 5), 
                         innerCvControl = list(list(V = 5)),
                         verbose = FALSE, 
                         method = "method.NNLS",
                         SL.library = c("SL.rpart",   # decision tree
                                        "SL.ranger",  # random forest
                                        "SL.glmnet",  # lasso
                                        "SL.xgboost", # boosted trees
                                        "SL.mean"))   # Y mean
## Warning in CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family =
## binomial(), : Only a single innerCvControl is given, will be replicated across
## all cross-validation split calls to SuperLearner

Explore the output

# Learn more about what is contained within the model
cv_sl
## 
## Call:  
## CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family = binomial(),  
##     SL.library = c("SL.rpart", "SL.ranger", "SL.glmnet", "SL.xgboost", "SL.mean"),  
##     method = "method.NNLS", verbose = FALSE, cvControl = list(V = 5), innerCvControl = list(list(V = 5))) 
## 
## 
## 
## Cross-validated predictions from the SuperLearner:  SL.predict 
## 
## Cross-validated predictions from the discrete super learner (cross-validation selector):  discreteSL.predict 
## 
## Which library algorithm was the discrete super learner:  whichDiscreteSL 
## 
## Cross-validated prediction for all algorithms in the library:  library.predict
# View the risk table
summary(cv_sl)
## 
## Call:  
## CV.SuperLearner(Y = heart_class$Y, X = heart_class$X, family = binomial(),  
##     SL.library = c("SL.rpart", "SL.ranger", "SL.glmnet", "SL.xgboost", "SL.mean"),  
##     method = "method.NNLS", verbose = FALSE, cvControl = list(V = 5), innerCvControl = list(list(V = 5))) 
## 
## 
## Risk is based on: Mean Squared Error
## 
## All risk estimates are based on V =  5 
## 
##       Algorithm     Ave        se      Min     Max
##   Super Learner 0.11722 0.0113943 0.088134 0.15460
##     Discrete SL 0.11607 0.0116867 0.084784 0.15543
##    SL.rpart_All 0.18315 0.0167049 0.142558 0.23328
##   SL.ranger_All 0.13734 0.0105342 0.122008 0.16847
##   SL.glmnet_All 0.11607 0.0116867 0.084784 0.15543
##  SL.xgboost_All 0.14758 0.0139655 0.110688 0.19415
##     SL.mean_All 0.24944 0.0026639 0.244474 0.25555
# View the discrete winner
table(simplify2array(cv_sl$whichDiscreteSL))
## 
## SL.glmnet_All 
##             5
# View the AUC-ROC table for: 
# 1. The individual algorithms
# 2. The discrete winner
# 3. The SuperLearner ensemble
ck37r::auc_table(cv_sl)
##                      auc         se  ci_lower  ci_upper      p-value
## SL.mean_All    0.5000000 0.05767758 0.3869540 0.6130460 4.666398e-13
## SL.rpart_All   0.7753934 0.03589421 0.7050420 0.8457447 7.213063e-05
## SL.xgboost_All 0.8657012 0.02082595 0.8248831 0.9065194 1.340076e-02
## SL.ranger_All  0.8851278 0.01857296 0.8487255 0.9215302 7.535160e-02
## SuperLearner   0.9100506 0.01667676 0.8773648 0.9427365 4.577962e-01
## SL.glmnet_All  0.9118182 0.01638189 0.8797102 0.9439261 5.000000e-01
## DiscreteSL     0.9118182 0.01638189 0.8797102 0.9439261 5.000000e-01
# Visualize the cross-validated risk
plot.CV.SuperLearner(cv_sl) + theme_bw()

# Print the weights table
cvsl_weights(cv_sl)
##   # Learner    Mean      SD     Min     Max
## 1 1  glmnet 0.84904 0.08951 0.77596 1.00000
## 2 2  ranger 0.13957 0.08266 0.00000 0.22044
## 3 3   rpart 0.00811 0.01620 0.00000 0.03695
## 4 4 xgboost 0.00328 0.00733 0.00000 0.01638
## 5 5    mean 0.00000 0.00000 0.00000 0.00000
# Compute ensemble AUC-ROC 
ck37r::cvsl_auc(cv_sl)
## $cvAUC
## [1] 0.9100506
## 
## $se
## [1] 0.01667676
## 
## $ci
## [1] 0.8773648 0.9427365
## 
## $confidence
## [1] 0.95
# Plot ROC curve
ck37r::plot_roc(cv_sl)

# View rough estimates of variable importance
set.seed(1)
var_importance <- ck37r::vim_corr(covariates = heart_class$covariates, 
                                  outcome = heart_class$outcome, 
                                  data = clean, 
                                  bootse = FALSE, 
                                  verbose = TRUE)
var_importance
##     rank  variable        corr      p_value  p_value_fdr     avg_con
## 015    1    thal_2  0.52733355 4.356225e-23 9.583695e-22 0.260869565
## 016    2    thal_3 -0.48611215 2.241269e-19 2.465396e-18 0.644927536
## 02     3     exang -0.43675708 1.520814e-15 1.115263e-14 0.550724638
## 021    4   oldpeak -0.43069600 4.085346e-15 2.246941e-14 0.002811183
## 020    5   thalach  0.42174093 1.697338e-14 7.468286e-14 0.246633775
## 09     6   slope_2  0.39406637 1.068761e-12 3.918791e-12 0.253623188
## 08     7   slope_1 -0.36205330 8.142433e-11 2.559051e-10 0.659420290
## 04     8      cp_2  0.31674216 1.735460e-08 4.772515e-08 0.130434783
## 0      9       sex -0.28093658 6.678692e-07 1.632569e-06 0.826086957
## 011   10      ca_2 -0.27399787 1.280668e-06 2.817470e-06 0.224637681
## 03    11      cp_1  0.24587910 1.498284e-05 2.996568e-05 0.065217391
## 010   12      ca_1 -0.23241224 4.409069e-05 8.083294e-05 0.318840580
## 017   13       age -0.22543872 7.524801e-05 1.273428e-04 0.100357180
## 012   14      ca_3 -0.21061527 2.220536e-04 3.489414e-04 0.123188406
## 06    15 restecg_1  0.17532180 2.191648e-03 3.214418e-03 0.405797101
## 018   16  trestbps -0.14493113 1.154606e-02 1.587583e-02 0.238295303
## 014   17    thal_1 -0.10658897 6.388288e-02 8.267196e-02 0.086956522
## 05    18      cp_3  0.08695687 1.309784e-01 1.600847e-01 0.050724638
## 019   19      chol -0.08523911 1.387903e-01 1.607046e-01 0.445189639
## 07    20 restecg_2 -0.06841024 2.351175e-01 2.586292e-01 0.021739130
## 013   21      ca_4  0.06644102 2.488996e-01 2.607520e-01 0.007246377
## 01    22       fbs -0.02804576 6.267775e-01 6.267775e-01 0.159420290
##        avg_case note
## 015 0.787878788     
## 016 0.169696970     
## 02  0.139393939     
## 021 0.001033742     
## 020 0.280969267     
## 09  0.648484848     
## 08  0.296969697     
## 04  0.418181818     
## 0   0.563636364     
## 011 0.042424242     
## 03  0.248484848     
## 010 0.127272727     
## 017 0.093079734     
## 012 0.018181818     
## 06  0.581818182     
## 018 0.229260692     
## 014 0.036363636     
## 05  0.096969697     
## 019 0.429486353     
## 07  0.006060606     
## 013 0.024242424     
## 01  0.139393939

Challenge

Customize model hyperparameters and refit the ensemble

Read Chapter 9 and Chapter 10 in the Guide to SuperLearner to learn to customize model hyperparameters and test algorithms with multiple hyperparameter settings.

Refit the ensemble with a few new, tuned algorithms compared to the existing algorithms contained within the SL.library parameter inside of CV.SuperLearner above (“SL.rpart”, “SL.ranger”, “SL.glmnet”, “SL.xgboost”, “SL.mean”).

What changed, and how were you able to optimize performance?

How did you find out which hyperparameters to tune?

HINT: type ?rpart, ?ranger, etc.

Other tutorials

caret: https://topepo.github.io/caret/

tidymodels: https://www.tidymodels.org/

mikropml: http://www.schlosslab.org/mikropml/articles/paper.html

Part 2 - Image classification with Keras in R

(coming soon)